pkg/docker: Add event-based Docker container info caching#1990
pkg/docker: Add event-based Docker container info caching#1990florianl wants to merge 21 commits into
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1990 +/- ##
==========================================
+ Coverage 69.23% 69.24% +0.01%
==========================================
Files 320 320
Lines 42323 42460 +137
==========================================
+ Hits 29301 29401 +100
- Misses 11325 11349 +24
- Partials 1697 1710 +13
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
CI Supervisor
|
2195482 to
6ae924c
Compare
Implement in-process caching of Docker container metadata in ContainerStore with event-driven invalidation. When containers die or are destroyed, Docker events trigger immediate cache eviction, eliminating stale metadata and reducing repeated API calls. This reduces CPU consumption in span enrichment pipeline by avoiding redundant Docker API calls. Flamegraph showed ~13.4% CPU in dockerEnricher; caching eliminates most calls for containers that host multiple processes.
6ae924c to
398570b
Compare
MrAlias
left a comment
There was a problem hiding this comment.
The caching direction makes sense, but I found one correctness issue in the invalidation path that looks blocking.
ContainerStore stores cache entries in byID using the shortened 12-character container ID derived from inspectInfo.ID, but the Docker event stream delivers the full container ID in msg.Actor.ID. In the current code that means the invalidation lookup never finds the cached entry, so die / destroy events do not actually evict stale metadata.
I think the fix should be to key invalidation with the full container ID internally, and only keep the 12-character form for exported metadata if that abbreviated form is still desired.
|
Nice! Thank you! In principle LGTM! But I have a question: are the performance tests taken in production/dev scenarios, or are they micro-benchmarks ran with I would expect that a cache at this level is not needed because the |
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
I noticed this issue, when experimenting with OBI and ebpf-profiler in the OTel demo using something like this config: So it's not a micro-benchmark but more a learning from a demo environment. |
MrAlias
left a comment
There was a problem hiding this comment.
One follow-up correctness concern on the shared Docker metadata cache: the new store-level PID cache can outlive the process lifecycle unless it also gets explicit PID-scoped eviction. The container-event invalidation path makes sense, but I think the store still needs a helper for process-deletion paths so a reused PID cannot pick up stale metadata from a previous container. After that, the existing lifecycle handlers in pkg/appolly/discover/docker_decorator.go on EventDeleted and pkg/transform/transform_docker.go on ProcessEventTerminated should call it so the shared cache stays aligned with PID lifetime.
Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
MrAlias
left a comment
There was a problem hiding this comment.
Two correctness issues still look unresolved on the current head, and I left one test suggestion to make the remaining PID-lifecycle bug reproducible.
Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
f70d6b7 to
2e4dbef
Compare
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
mariomac
left a comment
There was a problem hiding this comment.
LGTM despite I'd still remove the ID string field to ensure data consistency, as it's just a substring of FullID
| @@ -167,3 +206,101 @@ func ContainerMetadata[T ~string](dst map[T]string, ci *ContainerMeta, stringer | |||
| out[stringer(attr.ContainerID)] = ci.ID | |||
There was a problem hiding this comment.
for data consistency, I'd remove ID field and replace it here by attr.ContainerId[:abbreviationLength]
MrAlias
left a comment
There was a problem hiding this comment.
One remaining non-blocking cache-efficiency concern inline.
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
MrAlias
left a comment
There was a problem hiding this comment.
Two follow-up cache lifecycle comments inline. One is an adjacent stale-cache concern in the span path; the other is a current race in the new fullID cache write path.
Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
MrAlias
left a comment
There was a problem hiding this comment.
Sorry for the review churn here. I should have caught these cache lifecycle issues in earlier passes, and I appreciate your patience as we get this tightened up. I found two remaining races that can leave stale Docker metadata in the shared store, plus one lower-severity watcher enablement issue.
| ) | ||
|
|
||
| ctxInfo.DockerMetadata = docker.NewStore() | ||
| ctxInfo.DockerMetadata.Start(ctx) |
There was a problem hiding this comment.
Starting the Docker watcher here makes it run for every OBI process as soon as common context is built. In Kubernetes mode, or on hosts where Docker metadata is not otherwise used, watchContainerEvents() still retries Docker initialization once per second even though the Docker decorators later bypass themselves.
I think this should be started lazily only after Docker metadata is actually selected/enabled, or guarded by the same Docker-vs-Kubernetes enablement decision used by the decorators.
There was a problem hiding this comment.
Added K8sInformer.IsKubeEnabled(), like it is used in other places. Hope this is fine.
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
|
It seems like, the failing CI tests are related to #2304. |
|
Hi @florianl, do you mind syncing with main, the test failures have been resolved. |
Summary
Implement in-process caching of Docker container metadata in ContainerStore with event-driven invalidation. When containers die or are destroyed, Docker events trigger immediate cache eviction, eliminating stale metadata and reducing repeated API calls.
This reduces CPU consumption in span enrichment pipeline by avoiding redundant Docker API calls. Flamegraph showed ~13.4% CPU in dockerEnricher; caching eliminates most calls for containers that host multiple processes.
Validation